# Import Libraries
import pandas as pd # Dataframe object manipulation
import numpy as np # Array operations
import seaborn as sns # Plots
import matplotlib.pyplot as plt # Plotting
import pandas_profiling # Data exploration
# Modelling
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score
from sklearn.metrics import plot_roc_curve
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from matplotlib import pyplot
# MLI
import lime
import lime.lime_tabular
# Read in the Dataset.
# Import file should be located on project folder
path = 'cs-training.csv'
data = pd.read_csv(path,
usecols = ['SeriousDlqin2yrs',
'RevolvingUtilizationOfUnsecuredLines',
'age',
'NumberOfTime30-59DaysPastDueNotWorse',
'DebtRatio',
'MonthlyIncome',
'NumberOfOpenCreditLinesAndLoans',
'NumberOfTimes90DaysLate',
'NumberRealEstateLoansOrLines',
'NumberOfTime60-89DaysPastDueNotWorse',
'NumberOfDependents'])
Exploration of a dataset is paramount to a correct framing of our problem. Out objective is to predict an event of serious delinquency in the next 2 years based on 10 variables. In order to really understand our data and the vastness of the problem at hand we will focus on the next steps:
The Give Me Some Credit dataset has a recollection of important features regarding the potential credit risk of the Client. We have a total of 150.000 different clients available to user in our training excercise. The feature space for each client ranges from financial standing features, financial status to demographic.
Descriptive statistics of the dataset are as follows
# Standard pd description
(data.describe())
# PD.describe is a little lackluster
data.profile_report()
The Give Me Some credit competition support team kindly offered a dictionary for this dataset that is summarized for reader support
Regarding theoretical limits for variables, Utilization cannot be negative (unless overdraft). High Debt to Income Ratios should not be processed as outliers, when income becomes 0 debt to income limit approaches infinity. Negative incomes are data errors, possible outliers regarding demographic data
We will start visualization plots always taking into account our target feature: Loan Default in the next 2 years, as it is a part of a bank livelihood, credit default rates should be consistently imbalanced.
We have a total of 6.6% of the 150k population that has incurred in serious default events in two years. Given the nature of the feature and the dataframe I suspect this information is collected from a Mortgage portfolio.
The distribution of the target feature is presented as follows:
# Count values for Target feature
pd.DataFrame(data['SeriousDlqin2yrs'].value_counts(normalize=True))
# Histogram plot for target feature
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
labels = ['Default', 'Non-Default'] # Predicted Classes, default point may vary regarding local regulations
values = [(data[data['SeriousDlqin2yrs']==1].sum()['SeriousDlqin2yrs'])/150000, (150000-data[data['SeriousDlqin2yrs']==1].sum()['SeriousDlqin2yrs'])/150000] # Normalized by lenght of data 150.000
ax.bar(labels,values, color= ['red', 'green'], alpha=0.7)
plt.title('Figure 1. Distribution of Target Feature: SeriousDlqin2yrs')
plt.show()
As part of Portfolio theory, the goal of a risk manager is to select the optimal return for the associated risk on the investment. The bank provides loan applicants with the funds required for an especific purpose and in return it gets the interest on the money lent.[0]
A correct measurement of risk allows the bank to maximize its returns with an expected level of delinquency on its loans the goal of our model is to predict as accurate as possible this suggested level of delinquency in the next two years.
In addition to the target feature analysis, we provide expected signs for our feature space as well as visualizations for each feature in Annex(A).CreditDefaultEDA
The results from our EDA analysis are summarized on this table. All of this results come from a main question: What worries me about this feature?
| Feature Name | Expected Sign | Outlier Problems | Data Problems |
|---|---|---|---|
| SeriousDlqin2yrs | NA | NA | Unbalanced |
| RevolvingUtilizationOfUnsecuredLines | - | NA | Values > 1 suggest problems. Informative |
| Age | + | NA | Remove underage population |
| NumberOfTime30-59DaysPastDueNotWorse | - | NA | Nan Treatment, encode Values |
| DebtRatio | - | NA | NaN not associated to default, non reliable |
| MonthlyIncome | + | NA | NaN treatment, standarize |
| NumberOfOpenCreditLinesAndLoans | - | Cap at 10 | Standarize |
| NumberOfTimes90DaysLate | - | Encode, Relevant feature, possible leak | |
| NumberRealEstateLoansOrLines | + | ||
| NumberOfTime60-89DaysPastDueNotWorse | - | Encode | |
| NumberOfDependents | - |
The logic behind expected signs is as follows. Users in financial distress tend to start using their available resources to cover the leak in their financial distress. A high utilization at high rates suggest a troublesome set.For example, A low level of liquidity, low financial knowledge sophistication ( Highest borrowing rates in the market) or non expected financial windfall, among others.
As Warren buffet once said: " You cannot go through life borrowing at those rates" [1]
The logic behind age comes from dynamic Macroeconomic theory and the financial lifecycle. We expect (segregated by income) a Solow's type of income stock (Savings) were the bulk of the individual's life expenditure diminishes as life progresses,and income grows until it becomes static at retirement. [2]
Debt to income ratios suggest how much the incurred loans from the client are a burden to its day to day life, accurate income measuring its a problem in itself but when a client report 0 income to the bank, it is in the bank's best interest to offer reestructuring facilities to weather the crisis to a client. [3]
In practice however, a low debt to income ratio paradoxically has been reported to be associated to a decrease in payment rates as well. Small amounts of debt seem to easily escape the to-do list of high income segments of population.
Regarding Past-due Events our logic relies on Merton's modelling for financial assets. The closest you have been to default should weight on how likely you are to default in the future. Or, the standard deviation from expected returns should be priced proportionally to the expected return. A simple but elegant solution to asset pricing. [4]
This type of analysis are not possible on high dimensional feature spaces but from time to time, this low dimentional feature spaces give way for some leniency on the effect of theory on the expected results from our model
Identify clear Nan Values, replace with median. Median preferred due to robustness on data, and feasible value. The current state of missing values on the training set is displayed in the table below
pd.DataFrame(data.isna().sum(), columns = ['Missing Values']) # Diagnose Dataframe for missing values
# Create Boolean Features
data['MonthlyIncome_na'] = data['MonthlyIncome'].isna()
data['NumberOfDependents_na'] = data['MonthlyIncome'].isna()
# Input median on problem features
data['MonthlyIncome'] = data['MonthlyIncome'].fillna(data['MonthlyIncome'].median())
data['NumberOfDependents'] = data['NumberOfDependents'].fillna(data['NumberOfDependents'].median())
# Check data
pd.DataFrame(data.isna().sum(), columns=['Missing Values'])
As part of outlier treatment of variables we will clip outliers according to the interquartile range of the population
# Encode Outliers using plot above
data['RevolvingUtilizationOfUnsecuredLines_out'] = np.where(data['RevolvingUtilizationOfUnsecuredLines']>=2.14, 1, 0)
data['age_out'] = np.where(data['age'] >= 90, 1, 0)
data['NumberOfTime30-59DaysPastDueNotWorse_out'] = np.where(data['NumberOfTime30-59DaysPastDueNotWorse'] > 13, 1, 0)
data['DebtRatio_out'] = np.where(data['DebtRatio'] > 2.94, 1, 0)
data['MonthlyIncome_out'] = np.where(data['MonthlyIncome'] > 23000, 1, 0)
data['NumberOfOpenCreditLinesAndLoans_out'] = np.where(data['NumberOfOpenCreditLinesAndLoans'] > 20, 1, 0)
data['NumberOfTimes90DaysLate_out'] = np.where(data['NumberOfTimes90DaysLate'] > 10, 1, 0)
data['NumberOfTime60-89DaysPastDueNotWorse_out'] = np.where(data['NumberOfTime60-89DaysPastDueNotWorse'] > 3, 1, 0)
After preprocessing the integrity of the information holds. No records were omitted during the cleaning process, just adjusting outliers to max values inside each feature distribution
The first approach towards model creation will be with a random Forest Classifier with the standard parameters assigned by scikit-learn.A Train/Test split at 70% of the population is defined to avoid overfitting. The model should optimize for classification metrics such as AUC, or the components of the F1-Score.
# Target and Feature creation
X = data[['RevolvingUtilizationOfUnsecuredLines',
'age',
'NumberOfTime30-59DaysPastDueNotWorse',
'DebtRatio',
'MonthlyIncome',
'NumberOfOpenCreditLinesAndLoans',
'NumberOfTimes90DaysLate',
'NumberOfTime60-89DaysPastDueNotWorse',
'NumberOfDependents'
]]
y = data['SeriousDlqin2yrs']
# Split the dataset between train and test, 0.30 split used, 42 for consistency (And tradition)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42)
# Memory consumed by train dataframe
mem = X_train.memory_usage(index=True).sum()
print("Memory consumed by training set : {} MB" .format(mem/ 1024**2))
print('\n')
# Memory consumed by test dataframe
mem = X_test.memory_usage(index=True).sum()
print("Memory consumed by test set : {} MB" .format(mem/ 1024**2))
Before training our models, a quick memory sanity check is in order. The split for the training set consumes 8MB, and the Test split 3.43MB. Given the order of complexity of the models to be used, no memory shortage realted problems should arise.
Now, regarding the model, we will train a industry scikit-learn standard parameters model. More information available at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
A pipeline object is defined in case further preprocessing techniques should arise in the future. The reported AUC score regarding this model is 0.8405 the confusion matrix is displayed below
# Pipeline object defined if further preprocessing should be added
pipeline = Pipeline([
('classifier', RandomForestClassifier(random_state = 42))])
pipeline.fit(X_train, y_train)
# Model training
preds = pipeline.predict(X_test)
np.mean(preds == y_test) # Accuracy
# Accuracy report
accuracy_score(preds, y_test)
print(classification_report(y_test, preds))
print('AUC Score of: ', roc_auc_score(y_test, pipeline.predict_proba(X_test)[:,1]))
We implement GridSearch cross validation as implemented on Sklearn.model_selection, the chosen grid of hyperparameters are defined as follows:
Even a small change upon the standard chosen hyperparameters help in the classification metrics as it can be seen on the ROC plot below.
# Cross validation grid definition
hyperparameters = { 'classifier__n_estimators': [20, 30, 50],
'classifier__max_depth': [2, 4],
'classifier__min_samples_leaf': [2, 4]}
# CV train
clf = GridSearchCV(pipeline, hyperparameters, cv = 3)
# Fit and tune model
clf.fit(X_train, y_train)
# Model testing with CV model
preds = clf.predict(X_test)
np.mean(preds == y_test) # Accuracy prediction
# Roc curve comparison between models
rfc_no_cv = plot_roc_curve(pipeline, X_test, y_test) # Standard sk-learn RF model
rfc_cv = plot_roc_curve(clf, X_test, y_test, ax=rfc_no_cv.ax_) # CV model
rfc_cv.figure_.suptitle("ROC curve comparison")
The now refined model displays better AUC.
# Classification report for CV model
print(classification_report(y_test, preds))
The model parameters are not the main objective of this article, nonetheless, the model displays a precision of 0.94 across the test set, with areas under the receiving operator curve of 0.85.
The relatively low f1 score relies entirely on the use-case. If this model is used entirely for accepting/rejecting loans, extreme care should be taken of not excluding a large percentage of the candidate population.
A common approach at explaining the results of a non-linear model, relies on plotting the Variable Importance for the whole population we will follow but also expand on the caveats of this approach.
# Feature importance of non cv model
pipeline.steps[0][1].feature_importances_
# Feature importance of CV model example of how to use model results within a CV/pipeline environment
clf.best_estimator_.steps[0][1].feature_importances_
# Feature importance of CV model for plotting
feature_imp = pd.Series(clf.best_estimator_.steps[0][1].feature_importances_,index=X.columns).sort_values(ascending=False)
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index, color="salmon")
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()
Altough variable Importance shows us an important insight, regulators require granularity regarding client offerings/refusals. In order to comply to the most strict regulations we need to be able to provide reason codes for each client refused by the bank. Lets take the worst top 10 predicted probabilities on our test set.
# Worst ids, predict proba loses the index for client
print(pd.DataFrame(clf.predict_proba(X_test)).head(5))
type(X_test)
# Find the index of the worst predicted probabilities
X_test_prob_df = pd.DataFrame(clf.predict_proba(X_test)) # Worst probability clients ordered only by input not user_id
X_test_df = X_test.reset_index() # Reset index to join de data with only order and not user index
X_test_merge = X_test_df.merge(X_test_prob_df, left_index=True, right_index = True) # Retrieve the probabilities for each client
# Top 10 worst predicted clients in test dataframe
top10_worst_clients = X_test_merge.sort_values(by=1, ascending = False).head(10)
top10_worst_clients
# Setup
predict_fn_rf = lambda x: clf.predict_proba(x).astype(float) # Function to retrieve record probabilities
X = X_train.values # Array of train dataframe
explainer = lime.lime_tabular.LimeTabularExplainer(X,feature_names = X_train.columns,class_names=[0,1],kernel_width=5) # Model explainer initialization
# Worst predicted client id
y_test.loc[[69189]]
# Applying the explainer to this client and show results
choosen_instance = X_test.loc[[69189]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
exp.show_in_notebook(show_all=False)
# Applying the explainer to this client and show results
choosen_instance = X_test.loc[[59128]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
exp.show_in_notebook(show_all=False)
# Applying the explainer to this client and show results
choosen_instance = X_test.loc[[62640]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
exp.show_in_notebook(show_all=False)
# Applying the explainer to this client and show results
choosen_instance = X_test.loc[[42081]].values[0]
exp = explainer.explain_instance(choosen_instance, predict_fn_rf,num_features=10)
exp.show_in_notebook(show_all=False)
As it has been observed, variable importance on classifiers is just not enough for the level of granularity required to convince regulators and managers of the benefits of using ML techniques in Credit Risk.
This has been a small example that an accurate, fully explained model is possible, under current economic/Finance theory and by explaining client by client the reasoning of the coded model, the benefits and potential insights across all of the studied population are inmense.
The benefits of using non-linear approaches to find clusters of explainable phenomena across a banks portfolio are exactly what is needed now in times of distress and global economic recovery.
To recapitulate what we've done we:
The latest part seemed the most difficult. LIME is still on its early releases, it is still not fully scalable and the GUI elements from the explainer are still a bit unflexible but the power of explaining at the record level is indeed impressive.
There are some caveats at linearizing the probability space across the joint distributions, the explanations are only as good as the model linearization is. Further research on the algorithm is still needed to fully scale the results for the tens of thousands of applications any small bank receives per day.
The H2O solution for these problems was the developement of K-Lime. Displaying the results of the R^2 for each cluster linearization. Other (also non scalable) technique for MLI is the use of Shapley Values, a game theoretical approach in finding the contribution of each feature to the added predicted probability.
Further improvements to be done on this implementation could be regarding the preprocessing pipeline. Feature Engineering (As long as it doesn't break interpretability) and a more extensive CV search across all possible and recurrent best parameters for classification techniques.
[0] https://www.investopedia.com/terms/m/modernportfoliotheory.asp
[1] https://www.cnbc.com/2020/05/13/warren-buffett-cautions-against-carrying-a-credit-card-balance.html#:~:text=%E2%80%9CYou%20can't%20go%20through,pay%20down%20debt%2C%20the%20better.&text=And%20once%20you're%20debt,have%20to%20owe%20interest%20again
[2] Swan, Trevor W. (November 1956). "Economic growth and capital accumulation". Economic Record. 32 (2): 334–361
[3] https://www.investopedia.com/terms/d/debtrestructuring.asp#:~:text=Debt%20restructuring%20is%20a%20process,a%20default%20on%20sovereign%20debt.